Automatic Thesaurus Generation using Co-occurrence

نویسندگان

  • Rogier Brussee
  • Christian Wartena
چکیده

This paper proposes a characterization of useful thesaurus terms by the informativity of cooccurence with that term. Given a corpus of documents, informativity is formalized as the information gain of the weighted average term distribution of all documents containing that term. While the resulting algorithm for thesaurus generation is unsupervised, we find that high informativity terms correspond to large and coherent subsets of documents. We evaluate our method on a set of Dutch Wikipedia articles by comparing high informativity terms with keywords for the Wikipedia category of the articles.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Hearst's Rules for the Automatic Acquisition of Hyponyms for Mining a Pharmaceutical Corpus

Fully Automatic Thesaurus Generation (ATG) seeks to generate useful thesauri by mining a corpus of raw text. A number of statistical approaches, based on term co­ occurrence, exist for this, but in general they are only able to estimate the strength of the relationship between two terms, not its nature. In this paper we implement Hearst's method of discovering the hyponymy relations which are t...

متن کامل

Ad Hoc Retrieval Experiments Using WordNet and Automatically Constructed Thesauri

This paper describe our method in automatic-adhoc task of TREC-7. We propose a method to improve the performance of information retrieval system by expanded the query using 3 di ferent types of thesaurus. The expansion terms are taken from handcrafted thesaurus (WordNet), co-occurrence-based automatically constructed thesaurus, and syntactically predicate-argument based automatically constructe...

متن کامل

Alleviating Search Uncertainty Through Concept Associations: Automatic Indexing, Co-Occurrence Analysis, and Parallel Computing

In this article, we report research on an algorithmic apgather, process, and retrieve information. These systems proach to alleviating search uncertainty in a large inforprovide a wide variety of information and services, rangmation space. Grounded on object filtering, automatic ing from daily updates of foreign and national news, indexing, and co-occurrence analysis, we performed a movie revie...

متن کامل

English-Japanese Cross-lingual Query Expansion Using Random Indexing of Aligned Bilingual Text Data

Vector space models can be used for extracting semantically similar words from the co-occurrence statistics of words in large text data. In this paper, we report on our NTCIR 2002 experiments using the Random Indexing vector space method for extracting an English-Japanese cross-lingual thesaurus from aligned English-Japanese bilingual data. The crosslingual thesaurus has been used for automatic...

متن کامل

Graph-based Word Clustering using a Web Search Engine

Word clustering is important for automatic thesaurus construction, text classification, and word sense disambiguation. Recently, several studies have reported using the web as a corpus. This paper proposes an unsupervised algorithm for word clustering based on a word similarity measure by web counts. Each pair of words is queried to a search engine, which produces a co-occurrence matrix. By cal...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008